Skip to content

Add optional DuckDB connectors prototype (waterdata + wqp)#241

Closed
thodson-usgs wants to merge 2 commits intoDOI-USGS:mainfrom
thodson-usgs:worktree-duckdb-connector
Closed

Add optional DuckDB connectors prototype (waterdata + wqp)#241
thodson-usgs wants to merge 2 commits intoDOI-USGS:mainfrom
thodson-usgs:worktree-duckdb-connector

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

Summary

Adds dataretrieval.duckdb_connectors, an optional extension that wraps DuckDB connections with helper methods exposing the waterdata (OGC) and wqp endpoints as registerable SQL views. Each helper returns a duckdb.DuckDBPyRelation; pass register_as=<name> to also publish the result as a named view that subsequent SQL can reference.

This is a prototype / draft — opening for early feedback on shape and coverage before any release. Status: 15 tests passing, ruff check + ruff format --check clean, demo notebook executes end-to-end against the live API.

Why

Several use cases get noticeably more ergonomic in SQL than in pandas:

  • Joining heterogeneous endpoints — site metadata, daily values, time-series metadata, water-quality results all live behind separate getters; once registered as views they JOIN cleanly.
  • Window functions / time aggregations — top-N flow days per gauge, monthly means, rolling windows.
  • Compose with files on disk — DuckDB reads Parquet/CSV/S3 natively, so cached water data can be joined with external datasets without leaving SQL.

What's in the box

from dataretrieval.duckdb_connectors import waterdata, wqp

with waterdata.connect() as wd:
    wd.monitoring_locations(state_name="Illinois", register_as="sites")
    wd.daily(monitoring_location_id=["USGS-05586100"], parameter_code="00060",
             time="2023-01-01/2023-12-31", register_as="daily")
    wd.sql("""
        SELECT s.monitoring_location_name, avg(d.value) AS mean_cfs
        FROM sites s JOIN daily d USING (monitoring_location_id)
        GROUP BY 1 ORDER BY 2 DESC
    """)

Layout

dataretrieval/duckdb_connectors/
├── _base.py         # _require_duckdb, _flatten_geometry, _BaseConnection
├── waterdata.py     # WaterdataConnection + connect()
└── wqp.py           # WQPConnection + connect(legacy=, ssl_check=)

dataretrieval/duckdb_connector.py (singular) is preserved as a backward-compat alias re-exporting the waterdata connector.

Endpoints exposed

  • waterdata: monitoring_locations, daily, continuous, time_series_metadata, latest_continuous, latest_daily, field_measurements, samples
  • wqp: get_results, what_sites, what_organizations, what_projects, what_activities, what_detection_limits, what_habitat_metrics, what_activity_metrics (connection holds a legacy / ssl_check default; per-call overrides supported)
  • Generic: register_table(name, fn, **kwargs) for any (DataFrame, metadata)-returning getter not yet wrapped.

Optional dependencies

Added to pyproject.toml:

duckdb  = ["duckdb>=1.0.0"]
spatial = ["dataretrieval[nldi]", "dataretrieval[duckdb]"]   # compound extra

The DuckDB spatial extension is a runtime C++ binary that DuckDB downloads on first INSTALL spatial — not a pip package. Pass spatial=True to connect() to install + load it; registered geometry columns (stored as WKT) can then be parsed with ST_GeomFromText.

pyproject.toml tool.ruff.target-version was bumped to py310 (waterdata tests already gate at 3.10).

Geometry handling

Default: GeoDataFrame geometry is flattened to WKT plus longitude/latitude columns so the connectors work without the spatial extension. With spatial=True, native ST_* functions are available against the same WKT column.

Demo

demos/duckdb_waterdata_demo.ipynb walks through site discovery, daily values, monthly aggregation, top-N window functions, sites×daily join, latest readings, a cross-source waterdata × WQP join, and the spatial-extension affordance.

Test plan

  • pytest tests/duckdb_connectors_waterdata_test.py tests/duckdb_connectors_wqp_test.py — 15 passing (mocked at the getter boundary; spatial test skips if extension can't be fetched)
  • ruff check + ruff format --check clean
  • Live-tested every notebook section against api.waterdata.usgs.gov + waterqualitydata.us with an API token
  • Backward-compat: from dataretrieval import duckdb_connector; duckdb_connector.connect() still works
  • Decide whether NOTES.md (design notes captured during prototyping) belongs in the repo or should be dropped before merge
  • Decide whether to widen coverage to nldi / ngwmn

Known follow-ups

  • Hot-path import geopandas cleanup — already applied in this PR; will look for analogous spots elsewhere in the codebase in a follow-up branch.
  • DuckDB ST_Distance_Sphere axis convention surprised me during the demo; switched the demo to ST_Within with a polygon, which is unambiguous. Worth filing upstream once confirmed against duckdb-spatial docs.

🤖 Generated with Claude Code

thodson-usgs and others added 2 commits April 25, 2026 17:58
Adds dataretrieval.duckdb_connectors, an optional extension that wraps
DuckDB connections with helper methods exposing the dataretrieval
waterdata (OGC) and wqp endpoints as registerable SQL views. Each
helper returns a duckdb.DuckDBPyRelation; pass register_as=<name> to
also publish the result as a named view that subsequent SQL can
reference.

Highlights:
* Per-source layout (duckdb_connectors/{waterdata,wqp}.py) sharing a
  thin _BaseConnection in _base.py
* Optional dependency: pip install dataretrieval[duckdb]
* Compound spatial extra (pip install dataretrieval[spatial]) bundles
  geopandas + duckdb; spatial=True flag on connect() runs
  INSTALL spatial; LOAD spatial on the underlying connection so
  ST_GeomFromText etc. become available against registered views
* Geometry handled by converting GeoDataFrame geometry to WKT plus
  longitude/latitude columns so the prototype works without the
  spatial extension by default
* WQP connector threads connection-level legacy / ssl_check defaults
  through to every helper; per-call overrides supported
* dataretrieval.duckdb_connector preserved as a backward-compat alias
  for the waterdata connector
* Demo notebook covering site discovery, daily values, monthly
  aggregation, top-N window functions, sites x daily joins, latest
  readings, cross-source waterdata x WQP joins, and the spatial flag

15 tests pass; ruff check + format clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Narrow `_flatten_geometry` exception from `Exception` to `ValueError`
  (the actual error geopandas raises on `.x`/`.y` for non-Point
  geometries) so genuine bugs aren't swallowed.
* Drop the `if TYPE_CHECKING: import duckdb as _duckdb` block in
  `_base.py`. The runtime `try: import duckdb` is enough for type
  checkers; the alias was dead code.
* Parametrize `test_what_endpoint_invokes_correct_underlying` so each
  WQP helper gets its own test case (matches the parametrize pattern
  already used in `tests/waterdata_utils_test.py`).

20 tests pass; ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elbeejay
Copy link
Copy Markdown
Contributor

I get the overall idea here, but I think there's some value in revisiting the overall purpose and scope of the package. My understanding was that dataretrieval is meant to be an atomic Python API wrapper for USGS water data. Every addition of a feature beyond the that base functionality increases the maintenance burden. I'd wonder if something like DuckDB support could be a separate library that wraps the core dataretrieval.

@thodson-usgs
Copy link
Copy Markdown
Collaborator Author

nice to hear from you @elbeejay. I agree with you, and I'm leaning against merging this PR. I was curious whether duckdb might be useful, but quickly realized that the query language isn't all that important when a bot is writing your queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants